Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crashing #283

Closed
wants to merge 1 commit into from

Conversation

keboliu
Copy link
Collaborator

@keboliu keboliu commented Aug 11, 2022

Signed-off-by: Kebo Liu kebol@nvidia.com

Description

Catch TypeError and KeyError exception to protect xcvrd from crashing

Motivation and Context

This is to fix a crash issue: sonic-net/sonic-buildimage#11525

How Has This Been Tested?

run sonic-mgmt sfp reset test case: platform_tests/sfp/test_sfputil.py::test_check_sfputil_reset to make sure xcvrd will not crash

Additional Information (Optional)

@keboliu keboliu changed the title [xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crash [xcvrd] Catch TypeError and KeyError exception to protect xcvrd from crashing Aug 11, 2022
@liat-grozovik
Copy link
Collaborator

@prgeor this is critical fix for 202205. can you please approve?

('supported_min_laser_freq', str(port_info_dict['supported_min_laser_freq'])
if 'supported_min_laser_freq' in port_info_dict else 'N/A')
])
except (KeyError, TypeError) as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see each field is checked if available in port_info_dict[] before accessing, so why KeyError is required? Also why TypeError req. if it just accessing the dictionary field value?

This try to fix below crash

syslog.1:Jul 15 10:42:31.594198 r-leopard-56 INFO pmon#supervisord: xcvrd Process Process-2:
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd Traceback (most recent call last):
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     self.run()
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     self._target(*self._args, **self._kwargs)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 1679, in task_worker
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     rc = post_port_sfp_info_to_db(logical_port, self.port_mapping, self.xcvr_table_helper.get_intf_tbl(asic_index), transceiver_dict)
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/local/lib/python3.9/dist-packages/xcvrd/xcvrd.py", line 306, in post_port_sfp_info_to_db
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     fvs = swsscommon.FieldValuePairs(
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd   File "/usr/lib/python3/dist-packages/swsscommon/swsscommon.py", line 229, in __init__
syslog.1:Jul 15 10:42:31.598016 r-leopard-56 INFO pmon#supervisord: xcvrd     _swsscommon.FieldValuePairs_swiginit(self, _swsscommon.new_FieldValuePairs(*args))
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd TypeError: Wrong number or type of arguments for overloaded function 'new_FieldValuePairs'.
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd   Possible C/C++ prototypes are:
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector()
syslog.1:Jul 15 10:42:31.598074 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > > const &)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type)
syslog.1:Jul 15 10:42:31.598088 r-leopard-56 INFO pmon#supervisord: xcvrd     std::vector< std::pair< std::string,std::string > >::vector(std::vector< std::pair< std::string,std::string > >::size_type,std::vector< std::pair< std::string,std::string > >::value_type const &)

from this backtrace, I suppose at least one of the port_info_dict[] failed to fetch the value, either because of key error, or because none object of port_info_dict. I add the 'try' ... catch try to see what exactly the exception is, but the crash will disappear and no exception caught, if I remove the try..,catch I can see it again, I am not able to know the exact exception, so I have to add both of the exceptions here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOM table will be update with wrong values and user won't come to know this (i don't see anyone check the log unless any issue). Can we dig more which filed is causing the failure? I guess this is something to do with a specific optics

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect type and vendor_rev with change ('type', port_info_dict['type'] if type in port_info_dict else 'N/A') and ('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A') , seems I am not able to reproduce the crash anymore. So I would suggest catching KeyError, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor I do more tests today, I think it's not related to the cable, I have two testbeds with the same type of cable installed, but can only find issue on one of the testbeds, so I tend to believe this is a timing issue. I tried test to protect type and vendor_rev with change ('type', port_info_dict['type'] if type in port_info_dict else 'N/A') and ('vendor_rev', port_info_dict['vendor_rev'] if vendor_rev in port_info_dict else 'N/A') , seems I am not able to reproduce the crash anymore. So I would suggest catching KeyError, and skipping updating DB in case of exception, so xcvrd will be able to update the DB in the next iteration. What do you say?

@keboliu lets proceed with catching KeyError and also print the error message on keyError to know which field is giving error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keboliu do you mind using dictionary's get() method for each of these fields,? That would make the code more readable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see some more failures during the test, and need some more time to investigate.

@keboliu
Copy link
Collaborator Author

keboliu commented Aug 31, 2022

close this PR and open sonic-net/sonic-platform-common#305 to fix the crash

@keboliu keboliu closed this Aug 31, 2022
@keboliu keboliu deleted the xcvrd_exception_fix branch October 28, 2023 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants